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Abstract 

Many real-world regression problems demand 
a measure of the uncertainty associated with 
each prediction. Standard decision forests 
deliver efficient state-of-the-art predictive per¬ 
formance, but high-quality uncertainty esti¬ 
mates are lacking. Gaussian processes (GPs) 
deliver uncertainty estimates, but scaling GPs 
to large-scale data sets comes at the cost of ap¬ 
proximating the uncertainty estimates. We ex¬ 
tend Mondrian forests, first proposed by Lak¬ 
shminarayanan et al. (2014) for classification 
problems, to the large-scale non-parametric 
regression setting. Using a novel hierarchi¬ 
cal Gaussian prior that dovetails with the 
Mondrian forest framework, we obtain princi¬ 
pled uncertainty estimates, while still retain¬ 
ing the computational advantages of decision 
forests. Through a combination of illustra¬ 
tive examples, real-world large-scale datasets, 
and Bayesian optimization benchmarks, we 
demonstrate that Mondrian forests outper¬ 
form approximate GPs on large-scale regres¬ 
sion tasks and deliver better-calibrated uncer¬ 
tainty assessments than decision-forest-based 
methods. 


1 Introduction 

Gaussian process (GP) regression is popular due to its 
ability to deliver both accurate non-parametric predic¬ 
tions and estimates of uncertainty for those predictions. 
The dominance of GP regression in applications such 
as Bayesian optimization, where uncertainty estimates 
are key to balance exploration and exploitation, is a 
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testament to the quality of GP uncertainty estimates. 

Unfortunately, the computational cost of GPs is cubic 
in the number of data points, making them computa¬ 
tionally very expensive for large scale non-parametric 
regression tasks. (Specifically, we focus on the scenario 
where the number of data points N is large, but the 
number of dimensions D is modest.) Steady progress 
has been made over the past decade on scaling GP 
inference to big data, including some impressive recent 
work such as [6, 10, 12]. 

Ensembles of randomized decision trees, also known 
as decision forests , are popular for (non-probabilistic) 
non-parametric regression tasks, often achieving state- 
of-the-art predictive performance [3]. The most popular 
decision forest variants are random forests (Breiman- 
RF) introduced by Breiman [1] and extremely random¬ 
ized trees (ERT) introduced by Geurts et al. [11]. The 
computational cost of learning decision forests is typi¬ 
cally 0(N log N) and the computation across the trees 
in the forest can be parallelized trivially, making them 
attractive for large scale regression tasks. While deci¬ 
sion forests usually yield good predictions (as measured 
by, e.g., mean squared error or classification accuracy), 
the uncertainty estimates of decision forests are not as 
good as those produced by GPs. For instance, Jitkrit- 
tum et al. [15] compare the uncertainty estimates of 
decision forests and GPs on a simple regression prob¬ 
lem where the test distributions are different from the 
training distribution. As we move away from the train¬ 
ing distribution, GP predictions smoothly return to 
the prior and exhibit higher uncertainty. However, the 
uncertainty estimates of decision forests are less smooth 
and do not exhibit this desirable property. 

Our goal is to combine the desirable properties of GPs 
(good uncertainty estimates, probabilistic setup) with 
those of decision forests (computational speed). To this 
end, we extend Mondrian forests (MFs), introduced by 
Lakshminarayanan et al. [16] for classification tasks, 
to non-parametric regression tasks. Unlike usual deci¬ 
sion forests, we use a probabilistic model within each 
tree to model the labels. Specifically, we use a hier- 
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archical Gaussian prior over the leaf node parameters 
and compute the posterior parameters efficiently using 
Gaussian belief propagation [19]. Due to special proper¬ 
ties of Mondrian processes, their use as the underlying 
randomization mechanism results in a desirable uncer¬ 
tainty property: the prediction at a test point shrinks to 
the prior as the test point moves further away from the 
observed training data points. We demonstrate that, 
as a result, MFs yield better uncertainty estimates. 

The paper is organized as follows: in Section 2, we 
briefly review Mondrian forests. We present MFs for 
regression in Section 3 and discuss inference and pre¬ 
diction in detail. We present experiments in Section 5 
that demonstrate that (i) MFs produce better uncer¬ 
tainty estimates than Breiman-RF and ERT when test 
distribution is different from training distribution, (ii) 
MFs outperform or achieve comparable performance to 
large scale approximate GPs in terms of mean squared 
error (MSE) or negative log predictive density (NLPD), 
thus making them well suited for large scale regres¬ 
sion tasks where uncertainty estimates are important, 
and (iii) MFs outperform (or perform as well as) de¬ 
cision forests on Bayesian optimization tasks, where 
predictive uncertainty is important (since it guides the 
exploration-exploitation tradeoff). Finally, we discuss 
avenues for future work in Section 6. 
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Figure 1: (left) A Mondrian tree over six data points in 
[0, l] 2 . Every node in the tree represents a split and is em¬ 
bedded in time (vertical axis), (middle) An ordinary deci¬ 
sion tree partitions the whole space, (right) In a Mondrian 
tree, splits are committed only within the range of the data 
in each block (denoted by gray rectangles). Let j = left(e) 
be the left child of the root: then Bj = (0,0.37] x (0,1] 
is the block containing the red circles and Bj C Bj is the 
smallest rectangle enclosing the two data points. (Adapted 
from [16], with permission.) 


location of the split. In particular, 

-Rieft (j) •— {a: £ Bj . xgj t sj } and 

B r \gbt(j) := { x € Bj : x$j > £j}. 

We will write Bj = (iji, wyi] x ... x (ijD, %£>], where 
^jd and Ujd denote the -tower and upper bounds, 
respectively, of the rectangular block Bj along di¬ 
mension d. Put ij = {£j i ,ij 2 , ■ ■ - and Uj = 

{uji,Uj 2 , ■ ■ ■ ,ujd}- Let N(j) denote the indices of 
training data points at node j. 


2 Mondrian forests 


2.2 Mondrian trees and Mondrian forests 


Lakshminarayanan et al. [16] introduced Mondrian 
forests (MFs) for classification tasks. For completeness, 
we briefly review decision trees and Mondrian trees 
before describing how MFs can be applied to regression. 
Our problem setup is the following: given N labeled 
examples (aq, yi), ..., (xn, Vn) € R D x R as training 
data, our task is to predict labels 1 y £ R for unlabeled 
test points x £ R D as well as provide corresponding 
estimates of uncertainty. Let Xi :n := (aq,..., x n ), 
b 1 :n * — (2/1, - - - , Vn) ^ and . — (-Xd:n, b 1 :n) • 

2.1 Decision trees 

Following [16], a decision tree is a triple (T, S, £) where 
T is a finite, rooted, strictly binary tree and 6 = (Sj) 
and £ = (£j) specify, for every internal node j g T\ 
leaves(T), a split dimension Sj £ {1, ...,D} and 
split location £j £ R. A decision tree represents a 
hierarchical partition of R D into blocks, one for each 
node, as follows: At the root node, e, we have B e = R D , 
while each internal node j £ T \ leaves(T) represents 
a split of its parent’s block into two halves, with Sj 
denoting the dimension of the split and denoting the 

2 We refer to y £ R as label even though it is common 
in statistics to refer to y £ R as response instead of label. 


A Mondrian process [24] is a continuous-time Markov 
process (Mt : t > 0), where, for every t > s > 0, Mt is 
a hierarchical binary partition of R 12 and a refinement 
of M s . Mondrian trees [16] are restrictions of Mon¬ 
drian processes to a finite set of points. (See Figure 1.) 
In particular, a Mondrian tree T is a tuple (T, £, t), 

where (T, J, £) is a decision tree and r = {r, } je j spec¬ 
ifies a split time r,- > 0 with each node j. Split times 
increase with depth, i.e., Tj > T parent (y) and play an 
important role in online updates. 

The expected depth of a Mondrian tree is parametrized 
by a non-negative lifetime parameter A > 0. Since it 
is difficult to specify A, Lakshminarayanan et al. [16] 
set A = 00 and stopped splitting a node if all the class 
labels of the data points within the node were identical. 
We follow a similar approach for regression: we do not 
split a node which has less than min_samples_split num¬ 
ber of data points. 2 Given a bound min_samples_split 
and training data Z?i ;n , the generative process for sam¬ 
pling Mondrian trees is described in Algorithms 1 and 2. 

The process is similar to top-down induction of decision 
trees except for the following key differences: (i) splits 

Specifying min_samples_split instead of max-depth is 
common in decision forests, cf. [11]. 
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Algorithm 1 SampleMondrianTree(2? 1: „, min_samples_split) 

1: Initialize: T = 0, leaves(T) = 0, S = 0, £ = 0, r = 0, N(e) = {1,2 ,n} > initialize empty tree 

2: SampleMondrianBlock(e, I? 7 v(e)) m i n - sam pl es - s plit) > Algorithm 2 


Algorithm 2 SampleMondrianBlock(j,2?jv(j)i min_samples_split) 

1: Add j to T and for all d, set £* d = min (X N ^p d ),Uj d = max(X N ^p d ) > dimension-wise min and max 

2: if |iV(j) | > min_samples_split then >j is an internal node. |IV(j)| denotes ff data points. 

3: Sample E from exponential distribution with rate Yld( u jd ~ ^jd) 

4. Set Tj ^"parent(j) T ^ 

5: Sample split dimension Sj, choosing d with probability proportional to Uj d — £^ d 

6: Sample split location uniformly from interval [£jg i u js] 

7: Set iV (I eft (j)) = {n £ N(j) : X n , Sj < &•} and N(nght(j)j = {n £ N(j) : X n , Sj > &•} 

8: SampleMondrianBlock(left(j),TV(i e ft(j)), min_samples_split) 

9: SampleMondrianBlock(right(j), D/vpightO))? min_samples_split) 

10: else > j is a leaf node 

11: Set Tj = oo and add j to leaves(T) 


in a Mondrian tree are committed only within the 
range of training data (see Figure 1), and (ii) the 
split dimensions and locations are chosen independent 
of the labels and uniformly within Bf (see lines 5, 
6 of Algorithm 2). A Mondrian forest consists of 
M i.i.d. Mondrian trees T m = (T m , S m , £ m , r m ) for 
m = 1,..., M. See [16] for further details. 

Mondrian trees can be updated online in an efficient 
manner and remarkably, the distribution of trees sam¬ 
pled from the online algorithm is identical to the cor¬ 
responding batch counterpart [16]. We use the batch 
version of Mondrian forests (Algorithms 1 and 2) in all 
of our experiments except the Bayesian optimization 
experiment in section 5.3. Since we do not evaluate the 
computational advantages of online Mondrian forest, 
using a batch Mondrian forest in the Bayesian optimiza¬ 
tion experiment would not affect the reported results. 
For completeness, we describe the online updates in 
Algorithms 3 and 4 in the supplementary material. 


a node is similar to that of its parent’s. The predictive 
pr{y\x, 2?i ; jv) is then obtained via marginalization. 

As is common in the decision tree literature, we assume 
the labels within each block are independent of X given 
the tree structure. Lakshminarayanan et al. [16] used a 
hierarchy of normalized stable processes (HNSP) prior 
for classification problems. In this paper, we focus 
on the case of real-valued labels. Let M{p, v) denote 
a Gaussian distribution with mean p and variance v. 
For every j £ T, let pj be a mean parameter (of a 
Gaussian distribution over the labels) at node j, and 
let p = {pj : j £ T}. We assume the labels within a 
leaf node are Gaussian distributed: 

Un |T, p ~ A^*(^/| ea f( 3 , Ti ) , (Ty ) (1) 

where (Ty is a parameter specifying the variance of the 
(measurement) noise. 

We use the following hierarchical Gaussian prior for p: 
For hyperparameters pu, 7 i> 72 , let 


3 Model, hierarchical prior, and 
predictive posterior for labels 

In this section, we describe a probabilistic model 
that will determine the predictive label distribution, 
px(y\x, V i : jv), for a tree T = (T, <$,£, t), dataset 2 ?i : at, 
and test point x. Let leaf(a;) denote the unique leaf 
node j £ leaves(T) such that x £ Bj. Like with Mon¬ 
drian forests for classification, we want the predictive 
label distribution at a: to be a smoothed version of 
the empirical distribution of labels for points in £>ie a f(x) 
and in Bji for nearby nodes f. We will also achieve 
this smoothing via a hierarchical Bayesian approach: 
every node is associated with a label distribution, and 
a prior is chosen under which the label distribution of 


Pe\pH ^ J\f(.PHi Pj |/GarentO') ^ -^"(/^parent(j) ) )) 

where 4>j = 7 icr( 7 2 r 7 ) - 7 i< T ( 72 T' pa '-ent(j)) with the con¬ 
vention that T parent ( e ) = 0, and a(t) = (1 + e - *) -1 
denotes the sigmoid function. 

Before discussing the details of posterior inference, we 
provide some justification for the details of this model: 
Recall that r ; increases as we go down the tree, and so 
the use of the sigmoid er(-) encodes the prior assumption 
that children are expected to be more similar to their 
parents as depth increases. The Gaussian hierarchy is 
closed under marginalization, i.e., 

p e \p H ~ A f(p H ,<t>e) 

Po\Pt,PH ~7V(/X e ,0o) 


=> Po\HH ~ Af(p H ,4>e + 4>o), 
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where (j) e + <j> 0 = 7 icr( 7 2 r e ) - 7 ict( 7 2 0 ) + 7 ict( 7 2 t 0 ) - 
7icr(7 2 T e ) = 7 i<t(7 2 to) — 7ict(7 2 0). Therefore, we can 
introduce intermediate nodes without changing the 
predictive distribution. In Section 3.3, we show that a 
test data point can branch off into its own node: the 
hierarchical prior is critical for smoothing predictions. 

Given training data V i : jv, our goal is to compute the 
posterior density over fi: 

N 

Pt(^\T>1:n) ocpt(r) (2) 

n =1 

The posterior over ji will be used during the prediction 
step described in Section 3.3. Note that the posterior 
over fi is computed independently for each tree, and 
so can be parallelized trivially. 

3.1 Gaussian belief propagation 

We perform posterior inference using belief propagation 
[21]. Since the prior and likelihood are Gaussian, all 
the messages can be computed analytically and the 
posterior over fi is also Gaussian. Since the hierarchy 
has a tree structure, the posterior can be computed in 
time that scales linearly with the number of nodes in the 
tree, which is typically 0(N ), hence posterior inference 
is efficient compared to non-tree-structured Gaussian 
processes whose computational cost is typically 0(N 3 ). 
Message passing in trees is a well-studied problem, and 
so we refer the reader to [19, Chapter 20] for details. 

3.2 Hyperparameter heuristic 

In this section, we briefly describe how we choose the 
hyperparameters 9 = {p,jj , 7 i, 72 , cr]]}. More details 
can be found in Appendix B in the supplementary ma¬ 
terial. For simplicity, we use the same values of these 
hyperparameters for all the trees; it is possible to opti¬ 
mize these parameters for each tree independently, and 
would be interesting to evaluate this extra flexibility 
empirically. Ideally, one might choose hyperparameters 
by optimizing the marginal likelihood (computed as 
a byproduct of belief propagation) by, e.g., gradient 
descent. We use a simpler approach here: we maximize 
the product of the individual label marginals, assum¬ 
ing a individual label noise, which yields closed-form 
solutions for pn and 71 . This approach does not yield 
an estimate for y 2 , and so, using the fact that r in¬ 
creases with N, we pre-process the training data to 
lie in [0, 1 ] w and set y 2 = H/(201og 2 N) based on the 
reasoning that (i) r is inversely proportional to D and 
(ii) r increases with tree depth and the tree depth is 
0(log 2 N) assuming balanced trees . 3 

'In Lakshminarayanan et al. [16], it was observed that 
the average tree depths were 2-3 times log 2 (iV) in practice. 


Lakshminarayanan et al. [16] proposed to stop splitting 
a Mondrian block whenever all the class labels were 
identical. 4 We adopt a similar strategy here and stop 
splitting a Mondrian block if the number of samples is 
fewer than a parameter min_samples_split. It is common 
in decision forests to require a minimum number of sam¬ 
ples in each leaf, for instance Breiman [1] and Geurts 
et al. [11] recommend setting mimsamplesJeaf = 5 for 
regression problems. We set min_samples_split = 10. 

3.3 Predictive variance computation 

The prediction step in a Mondrian regression tree is 
similar to that in a Mondrian classification tree [16, 
Appendix B] except that at each node of the tree, we 
predict a Gaussian posterior over y rather than pre¬ 
dict a posterior over class probabilities. Recall that 
a prediction from a vanilla decision tree is just the 
average of the training labels in leaf (a;). Unlike de¬ 
cision trees, in a Mondrian tree, a test point x can 
potentially ‘branch off’ the existing Mondrian tree at 
any point along the path from root to leaf (a:). Hence, 
the predictive posterior over y from a given tree T is a 
mixture of Gaussians of the form 

Pt(v\x,T> 1 . n )= ^2 WjN (y\mj,Vj), (3) 

j’G path (leaf (a;)) 

where Wj denotes the weight of each component, given 
by the probability of branching off just before reaching 
node j. and mj,Vj respectively denote the predictive 
mean and variance. The probability of branching off 
increases as the test point moves further away from the 
training data at that particular node; hence, the pre¬ 
dictions of MFs exhibit higher uncertainty as we move 
farther from the training data. For completeness, we 
provide pseudocode for computing (3) in Algorithm 5 
of the supplementary material. 

If a test data point branches off to create a new node, 
the predictive mean at that node is the posterior of the 
parent of the new node; if we did not have a hierarchy 
and instead assumed the predictions at leaves were i.i.d, 
then branching would result in a prediction from the 
prior, which would lead to biased predictions in most 
applications. The predictive mean and variance for the 
mixture of Gaussians are 

Ej'[y] = ^2WjVij and 

j 

Var T [y] = Y W J ( v j + W 1) _ ( E t Iv] ) 2 » ( 4 ) 

3 

technically, the Mondrian tree is paused in the online 
setting and splitting resumes once a block contains more 
than one distinct label. However, since we only deal with 
the batch setting, we stop splitting homogeneous blocks. 
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and the prediction of the ensemble is then 

p(y\x,V 1:N ) = —^2p Tm {y\x,V 1:N ). (5) 

m 

The prediction of the ensemble can be thought of as 
being drawn from a mixture model over M trees where 
the trees are weighted equally. The predictive mean 
and variance of the ensemble can be computed using 
the formula for mixture of Gaussians similar to (4). 
Similar strategy has been used in [4, 14] as well. 

4 Related work 

The work on large scale Gaussian processes can be 
broadly split into approaches that optimize induc¬ 
ing variables using variational approximations and ap¬ 
proaches that distribute computation by using experts 
that operate on subsets of the data. We refer to [6] 
for a recent summary of large scale Gaussian processes. 
Hensman et al. [12] and Gal et al. [10] use stochastic 
variational inference to speed up GPs, building on the 
variational bound developed by Titsias [28]. Deisen- 
roth and Ng [6] present the robust Bayesian committee 
machine (rBCM), which combines predictions from 
experts that operate on subsets of data. 

Hutter [13] investigated the use of Breiman-RF for 
Bayesian optimization and used the empirical variance 
between trees in the forest as a measure of uncertainty. 
(Hutter et al. [14] proposed a further modification, see 
Appendix C.) Eslami et al. [9] used a non-standard deci¬ 
sion forest implementation where a quadratic regressor 
is fit at each leaf node, rather than a constant regressor 
as in popular decision forest implementations. Their 
uncertainty measure—a sum of the Kullback-Leibler 
(KL) divergence—is highly specific to their applica¬ 
tion of accelerating expectation propagation, and so 
it seems their method is unlikely to be immediately 
applicable to general non-parametric regression tasks. 
Indeed, Jitkrittum et al. [15] demonstrate that the un¬ 
certainty estimates proposed by [9] are not as good as 
kernel methods in their application domain when the 
test distribution is different from the training distribu¬ 
tion. As originally defined, Mondrian forests produce 
uncertainty estimates for categorical labels, but Lak¬ 
shminarayanan et al. [16] evaluated their performance 
on (online) prediction (classification accuracy) without 
any assessment of the uncertainty estimates. 

5 Experiments 

5.1 Comparison of uncertainty estimates of 
MFs to popular decision forests 

In this experiment, we compare uncertainty estimates 
of MFs to those of popular decision forests. The pre¬ 


diction of MFs is given by (5), from which we can 
compute the predictive mean and predictive variance. 0 
For decision forests, we compute the predictive mean 
as the average of the predictions from the individ¬ 
ual trees and, following Hutter [13, §11.1.3], compute 
the predictive variance as the variance of the predic¬ 
tions from the individual trees. We use 25 trees and 
set mimsamplesJeaf = 5 for decision forests to make 
them comparable to MFs with min_samples_split = 10. 
We used the ERT and Breiman-RF implementation in 
scikit-learn [22] and set the remaining hyperparameters 
to their default values. 

We use a simplified version of the dataset described in 
[15], where the goal is to predict the outgoing message 
in expectation propagation (EP) from a set of incom¬ 
ing messages. When the predictions are uncertain, the 
outgoing message will be re-computed (either numeri¬ 
cally or using a sampler), hence predictive uncertainty 
is crucial in this application. Our dataset consists of 
two-dimensional features (which are derived from the 
incoming message) and a single target (corresponding 
to mean of the outgoing message). The scatter plot of 
the training data features is shown in Fig. 2(a). We 
evaluate predictive uncertainty on two test distribu¬ 
tions, shown in red and blue in Fig. 2(a), which contain 
data points in unobserved regions of the training data. 

The mean squared error of all the methods are com¬ 
parable, so we focus just on the predictive uncertainty. 
Figures 2(b), 2(c), and 2(d) display the predictive un¬ 
certainty of MF, ERT and Breiman-RF as a function 
of x\. We notice that Breiman-RF’s predictions are 
over-confident compared to MF and ERT. The predic¬ 
tive variance is quite low even in regions where training 
data has not been observed. The predictive variance 
of MF is low in regions where training data has been 
observed (—5 < X\ < 5) and goes up smoothly as we 
move farther away from the training data; the red test 
dataset is more similar to the training data than the 
blue test data and the predictive uncertainty of MF on 
the blue dataset is higher than that of the red dataset, 
as one would expect. ERT is less overconfident than 
Breiman-RF, however its predictive uncertainty is less 
smooth compared to that of MF. 

5.2 Comparison to GPs and decision forests 
on flight delay dataset 

In this experiment, we compare decision forest variants 
to large scale Gaussian processes. Deisenroth and Ng 
[6] evaluated a variety of large scale Gaussian processes 
on the flight delay dataset, processed by Hensman et al. 
[12], and demonstrate that their method achieves state- 
of-the-art predictive performance; we evaluate decision 

5 Code available from the authors’ websites. 
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(a) Distribution of train/test (b) MF uncertainty 
inputs (labels not depicted) 


(c) ERT uncertainty 


(d) Breiman-RF uncertainty 


Figure 2: (a) Scatter distribution of training distribution and test distributions, (b-d) Typical uncertainty estimates from 
a single run of MF, ERT-fc and Breiman-RF, respectively, as a function of xi. (Averaging over multiple runs would create 
smooth curves while obscuring interesting patterns in the estimates which an application would potentially suffer from.) 
As desired, MF becomes less certain away from training inputs, while the other methods report high confidence spuriously. 


forests on the same dataset so that our predictive per¬ 
formance can be directly compared to large scale GPs. 
The goal is to predict the flight delay from eight at¬ 
tributes, namely, the age of the aircraft (number of 
years since deployment), distance that needs to be cov¬ 
ered, airtime, departure time, arrival time, day of the 
week, day of the month and month. 

Deisenroth and Ng [6] employed the following strategy: 
train using the first N data points and use the following 
100,000 as test data points. Deisenroth and Ng [6] 
created three datasets, setting N to 700A', 2 M (million) 
and 5 M respectively. We use the same data splits and 
train MF, Breiman-RF, ERT on these datasets so that 
our results are directly comparable. 6 We used 10 trees 
for each forest to reduce the computational burden. 

We evaluate performance by measuring the root mean 
squared error (RMSE) and negative log predictive den¬ 
sity (NLPD). NLPD, defined as the negative logarithm 
of (5), is a popular measure for measuring predictive 
uncertainty (cf. [23, section 4.2]). NLPD penalizes 
over-confident as well as under-confident predictions 
since it not only accounts for predictive mean but also 
the predictive variance. RF and ERT do not offer a 
principled way to compute NLPD. But, as a simple 
approximation, we computed NLPD for RF and ERT 
assuming a Gaussian distribution with mean equal to 
the average of trees’ predictions, variance equal to the 
variance of trees’ predictions. 

Table 1 presents the results. The RMSE and NLPD 
results for SVI-GP, Dist-VGP and rBCM results were 
taken from [6], who report a standard error lower than 
0.3 for all of their results. Table 1 in [6] shows that 


6 Gal et al. [10] report the performance of Breiman-RF 
on these datasets, but they restricted the maximum depth 
of the trees to 2, which hurts the performance of Breiman- 
RF significantly. They also use a random train/test split, 
hence our results are not directly comparable to theirs due 
to the non-stationarity in the dataset. 


rBCM achieves significantly better performance than 
other large scale GP approximations; hence we only 
report the performance of rBCM here. It is important 
to note that the dataset exhibits non-stationarity: as 
a result, the performance of decision forests as well as 
GPs is worse on the larger datasets. (This phenomenon 
was observed by Gal et al. [10] and Deisenroth and Ng 
[6] as well.) On the 700A' and 2 M dataset, we ob¬ 
serve that decision forests achieve significantly lower 
RMSE than rBCM. MF achieves significantly lower 
NLPD compared to rBCM, which suggests that its un¬ 
certainty estimates are useful for large scale regression 
tasks. However, all the decision forests, including MFs, 
achieve poorer RMSE performance than rBCM on the 
larger 5 M dataset. We believe that this is due to the 
non-stationary nature of the data. To test this hypoth¬ 
esis, we shuffled the 5,100,000 data points to create 
three new training (test) data sets with 5 M (100A') 
points; all the decision forests achieved a RMSE in the 
range 31-34 on these shuffled datasets. 

MF outperforms rBCM in terms of NLPD on all three 
datasets. On the 5M dataset, the NLPD of Breiman- 
RF is similar to that of MF, however Breiman-RF’s 
uncertainty is not computed in a principled fashion. As 
an additional measure of uncertainty, we report prob¬ 
ability calibration measures (akin to those for binary 
classification cf. http://scikit-learn.org/stable/modules/ 
calibration.html), also known as reliability diagrams [5], 
for MF, Breiman-RF and ERT. First, we compute the 
z% (e.g. 90%) prediction interval for each test data 
point based on Gaussian quantiles using predictive 
mean and variance. Next, we measure what fraction 
of test observations fall within this prediction interval. 
For a well-calibrated regressor, the observed fraction 
should be close to z%. We compute observed fraction 
for 2 = 10% to z = 90% in increments of 10. We 
report observed fraction minus ideal fraction since it 
is easier to interpret—a value of zero implies perfect 
calibration, a negative value implies over-confidence (a 
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700K/100K 

RMSE 

NLPD 

2M/100K 

RMSE 

NLPD 

5M/100K 

RMSE 

NLPD 

SVI-GP [12] 

33.0 

- 

- 

- 

- 

- 

Dist-VGP [10] 

33.0 

- 

- 

- 

- 

- 

rBCM [6] 

27.1 

9.1 

34.4 

8.4 

35.5 

8.8 

RF 

24.07 ± 0.02 

5.06 ± 0.02* 

27.3 ± 0.01 

5.19 ± 0.02* 

39.47 ± 0.02 

6.90 ± 0.05* 

ERT 

24.32 ± 0.02 

6.22 ± 0.03* 

27.95 ± 0.02 

6.16 ± 0.01* 

38.38 ± 0.02 

8.41 ± 0.09* 

MF 

26.57 ± 0.04 

4.89 ± 0.02 

29.46 ± 0.02 

4.97 ± 0.01 

40.13 ± 0.05 

6.91 ± 0.06 


Table 1: Comparison of MFs to popular decision forests and large scale GPs on the flight delay dataset. We report 
average results over 3 runs (with random initializations), along with standard errors. MF achieves significantly better 
NLPD than rBCM. RF and ERT do not offer a principled way to compute NLPD, hence they are marked with an asterix. 


Dataset 

Method 

0.1 

0.2 

0.3 

0.4 

0.5 

0.6 

0.7 

0.8 

0.9 

700K 

Breiman-RF 

-0.02 

-0.04 

-0.05 

-0.06 

-0.06 

-0.06 

-0.05 

-0.06 

-0.07 

700K 

ERT 

-0.04 

-0.07 

-0.11 

-0.14 

-0.16 

-0.18 

-0.19 

-0.19 

-0.18 

700K 

MF 

-0.01 

-0.02 

-0.01 

0 

0.02 

0.03 

0.03 

0.02 

0 

2M 

Breiman-RF 

-0.02 

-0.04 

-0.05 

-0.06 

-0.05 

-0.04 

-0.03 

-0.03 

-0.04 

2M 

ERT 

-0.04 

-0.08 

-0.12 

-0.15 

-0.17 

-0.18 

-0.19 

-0.18 

-0.16 

2M 

MF 

-0.02 

-0.04 

-0.05 

-0.05 

-0.03 

0 

0.02 

0.03 

0.01 

5M 

Breiman-RF 

-0.03 

-0.06 

-0.08 

-0.09 

-0.1 

-0.1 

-0.11 

-0.1 

-0.1 

5M 

ERT 

-0.04 

-0.07 

-0.11 

-0.14 

-0.16 

-0.18 

-0.19 

-0.19 

-0.18 

5M 

MF 

-0.02 

-0.04 

-0.05 

-0.06 

-0.06 

-0.05 

-0.05 

-0.05 

-0.07 


Table 2: Comparison of MFs to popular decision forests on the flight delay dataset. Each entry denotes the difference 
between the observed fraction minus the ideal fraction (which is shown at the top of the column). Hence, a value of zero 
implies perfect calibration, a negative value implies overconfidence and a positive value implies under-confident predictor. 
MF is better calibrated than Breiman-RF and ERT, which are consistently over-confident. 


lot more observations lie outside the prediction interval 
than expected) and a positive value implies under- 
confidence. The results are shown in Table 2. MF is 
clearly better calibrated than Breiman-RF and ERT, 
which seem to be consistently over-confident. Since 
5M dataset exhibits non-stationarity, MF appears to 
be over-confident but still outperforms RF and ERT. 
Deisenroth and Ng [6] do not report calibration mea¬ 
sures and their code is not available publicly, hence we 
do not report calibration measures for GPs. 

5.3 Scalable Bayesian optimization 

Next, we showcase the usefulness of MFs in a Bayesian 
optimization (BayesOpt) task. We briefly review the 
Bayesian optimization setup for completeness and re¬ 
fer the interested reader to [2, 25] for further details. 
Bayesian optimization deals with the problem of iden¬ 
tifying the global maximizer (or minimizer) of an un¬ 
known (a.k.a. black-box) objective function which is 
computationally expensive to evaluate.' Our goal is to 
identify the maximizer in as few evaluations as possi¬ 
ble. Bayesian optimization is a model-based sequential 
search approach to solve this problem. Specifically, 

7 For a concrete example, consider the task of optimizing 
the hyperparameters of a deep neural network to maximize 
validation accuracy. 


given n noisy observations, we fit a surrogate model 
such as a Gaussian process or a decision forest and 
choose the next location based on an acquisition func¬ 
tion such as upper confidence bound (UCB) [27] or ex¬ 
pected improvement (El) [18]. The acquisition function 
trades off exploration versus exploitation by choosing 
input locations where the predictive mean from the 
surrogate model is high (exploitation) or the predictive 
uncertainty of the surrogate model is high (exploration). 
Hence, a surrogate model with well-calibrated predic¬ 
tive uncertainty is highly desirable. Moreover, the 
surrogate model has to be re-fit after every new obser¬ 
vation is added; while this is not a significant limitation 
for a few (e.g. 50) observations and scenarios where 
the evaluation of the objective function is significantly 
more expensive than re-fitting the surrogate model, the 
re-fitting can be computationally expensive if one is 
interested in scalable Bayesian optimization [26]. 

Hutter [13] proposed sequential model-based algorithm 
configuration (SMAC), which uses Breiman-RF as the 
surrogate model and the uncertainty between the trees 
as a heuristic measure of uncertainty. 8 Nickson et al. 

s Hutter et al. [14, §4.3.2] proposed a further modifi¬ 
cation to the variance estimation procedure, where each 
tree outputs a predictive mean and variance, in the spirit 
of quantile regression forests [17]. See Appendix C for a 
discussion on how this relates to MFs. 
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[20] discuss a scenario where this heuristic produces 
misleading uncertainty estimates that hinders explo¬ 
ration. It is worth noting that SMAC uses El as the 
acquisition function only 50% of the time and uses 
random search the remaining 50% of the time (which 
is likely due to the fact that the heuristic predictive 
uncertainty can collapse to 0). Moreover, SMAC re-fits 
the surrogate model by running a batch algorithm; the 
computational complexity of running the batch version 
N times is £^ =1 £>(nlogn) = 0(N 2 \ogN) [16]. 

MFs are desirable for such an application since they can 
produce principled uncertainty estimates and can be ef¬ 
ficiently updated online with computational complexity 
J2n=i ^(log 71 ) = 0(N log N). Note that the cost of 
updating the Mondrian tree structure is O(logn), how¬ 
ever exact message passing costs 0(n). To maintain 
the 0(\ogn) cost, we approximate the Gaussian pos¬ 
terior at each node by a Gaussian distribution whose 
mean and variance are given by the empirical mean and 
variance of the data points at that node. Adding a new 
data point involves just updating mean and variance 
for all the nodes along the path from root to a leaf, 
hence the overall cost is O(logn). (See Appendix C.) 

We report results on four Bayesian optimization bench¬ 
marks used in [7, 26] , consisting of two synthetic func¬ 
tions namely the Branin and Hartmann functions, and 
two real-world problems, namely optimizing the hyper¬ 
parameters of online latent Dirichlet allocation (LDA) 
and structured support vector machine (SVM). LDA 
and SVM datasets consist of 288 and 1400 grid points 
respectively; we sampled Branin and Hartmann func¬ 
tions at 250,000 grid points (to avoid implementing a 
separate optimizer for optimizing over the acquisition 
function). For SVM and LDA, some dimensions of the 
grid vary on a non-linear scale (e.g. 10°, 10 _1 ,10 -2 ); 
we log-transformed such dimensions and scaled all di¬ 
mensions to [0,1] so that all features are on the same 
scale. We used 10 trees, set min_samples_split = 2 and 
use UCB as the acquisition function 9 for MFs. We re¬ 
peat our results 15 times (5 times each with 3 different 
random grids for Branin and Hartmann) and report 
mean and standard deviation. 

Following Eggensperger et al. [7], we evaluate a fixed 
number of evaluations for each benchmark and measure 
the maximum value observed. The results are shown in 
Table 3. The SMAC results (using Breiman-RF) were 
taken from Table 2 of [7]. Both MF and SMAC identify 
the optimum for LDA-grid. SMAC does not identify 
the optimum for Branin and Hartmann functions. We 
observe that MF finds maxima very close to the true 
maximum on the grid, thereby suggesting that better 
uncertainty estimates are useful for better exploration- 

9 Specifically, we set acquisition function = predictive 
mean + predictive standard deviation. 


Dataset (D, #evals) 

Oracle 

MF 

SMAC [7] 

Branin (2, 200) 

-0.398 

-0.400 ± 0.005 

-0.655 ± 0.27 

Hartmann (6, 200) 

3.322 

3.247 ± 0.109 

2.977 ± 0.11 

SVM-grid (3, 100) 

-1266.2 

-1266.36 ± 0.52 

-1269.6 ± 2.9 

LDA-grid (3, 50) 

-24.1 

-24.1 ± 0 

-24.1± 0.1 


Table 3: Results on BayesOpt benchmarks: Oracle re¬ 
ports the maximum value on the grid. MF, RF report the 
maximum value obtained by the respective methods. 

exploitation tradeoff. The computational advantage of 
MFs might not be significant with few evaluations, but 
we expect MFs to be computationally advantageous 
in applications with thousands of observations, e.g., 
scalable Bayesian optimization [26] and reinforcement 
learning [8]. 

5.4 Failure modes of our approach 

No method is a panacea: here we discuss two failure 
modes of our approach that would be important to ad¬ 
dress in future work. First, we expect CPs to perform 
better than decision forests on extrapolation tasks; a 
GP with an appropriate kernel (and well-estimated 
hyperparameter values) can extrapolate beyond the 
observed range of training data; however, the predic¬ 
tions of decision forests with constant predictors at leaf 
nodes are confined to the range of minimum and maxi¬ 
mum observed y. If extrapolation is desired, we need 
complex regressors (that are capable of extrapolation) 
at leaf nodes of the decision forest. However, this will 
increase the cost of posterior inference. Second, MFs 
choose splits independent of the labels; hence irrelevant 
features can hurt predictive performance [16]; in the 
batch setting, one can apply feature selection to filter 
or down weight the irrelevant features. 

6 Discussion 

We developed a novel and scalable methodology for re¬ 
gression based on Mondrian forests that provides both 
good predictive accuracy as well as sensible estimates of 
uncertainty. These uncertainty estimates are important 
in a range of application areas including probabilistic 
numerics, Bayesian optimization and planning. Using 
a large-scale regression application on flight delay data, 
we demonstrate that our proposed regression framework 
can provide both state-of-the-art RMSE and estimates 
of uncertainty as compared to recent scalable GP ap¬ 
proximations. We demonstrate that Mondrian forests 
deliver better-calibrated uncertainty estimates than 
existing decision forests, especially in regions far away 
from the training data. Since Mondrian forests deliver 
good uncertainty estimates and can be trained online 
efficiently, they seem promising for applications such 
as Bayesian optimization and reinforcement learning. 
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Mondrian Forests for Large-Scale Regression when Uncertainty Mat¬ 
ters: Supplementary material 

A Pseudocode for online learning and prediction 

The online updates are shown in Algorithms 3 and 4. The prediction step is detailed in Algorithm 5. 


Algorithm 3 ExtendMondrianTree(T, V, min samples split) 


1: Input: Tree T = (T, 5, £, r), new training instance T> = (x, y) 

2: ExtendMondrianBlock(T, e, T>, min_samples_split) 

> Algorithm f 



Algorithm 4 ExtendMondrianBlock(T, j,V, min samples split) 



1: Set e e = max(£j — a;, 0) and e“ = max(a: — uj,0) > e e = e“ = if x € Bf 

2: Sample E from exponential distribution with rate LU e d + eff) 

3: if T parent (j) + E < Tj then > introduce new parent for node j 

4: Sample split dimension 6, choosing d with probability proportional to 

5: Sample split location £ uniformly from interval [Uj S , xf\ if x$ > u* 5 else [x 5 ,£fj s ]. 

6 : Insert a new node j just above node j in the tree, and a new leaf j", sibling to j , where 

7: 5j = 6, Tj = Tp arent (j) +E, = min x), uj = max(uj, x) 

8: j" = left(J) iff x S} < fj 

9: SampleMondrianBlock(j", V, min_samples_split) 

10: else 

11: Update £j 4— min (£J, x), uj <— max(uj, x) > update extent of node j 

12: if j leaves(T) then > return if j is a leaf node, else recurse down the tree 

13: if xsj < fj then child(j) = left(j) else child(j) = right(j) 

14: ExtendMondrianBlock(T, child(j), V, min_samples_split) c> recurse on child containing V 


Algorithm 5 Predict(T, x) 

1 : > Description of prediction using a Mondrian tree given by (3). 

2 : > The predictive mean, predictive variance and NLPD computation are not shown, but they can be computed 

easily during the top-down pass using the weights Wj and posterior moments nij,Vj at node j. 

3. Initialize J — 6 and PlMotSeparatedYet — 1 

4: while True do 

5: Set A j = Tj - T parent (j) and r]j(x) = Ed( max 0Ui ~ °) + max (^d - x d, 0)) 

6: Set Pj(x) = 1 — exp(— Ajijj(x)) 

7: if Pj{x) > 0 then 

8: Wj — PlMotSeparatedYet Pj () 

9: if j £ leaves(T) then 

10: Wj —PNotSeparatedYet (1 Pj ( x ) ) 

11: return 

12: else 

13: PlMotSeparatedYet 1 PNotSeparatedYet(l Pj('^)) 

14: if x§ < f.j then j 4 — I eft ( j ) else j ■£- right(j) > recurse to the child where x lies 


B Choosing the hyperparameters 

In this appendix, we give more details on how we choose the hyper parameters 0 = {/i/r, 71 , 72 , cr'y}. For simplicity, 
we used the same values of these hyperparameters for all the trees; it is possible to optimize these parameters for 
each tree independently. 
















Mondrian Forests for Large-Scale Regression when Uncertainty Matters 


We optimize the product of label marginals , integrating out fi for each label individually, i.e., 

q(Y\6,T)= AT(y n \(J,H,4>j ~ <?Wnt( e ) + CT y)- 

j’Gleaves(T) n(zN(j ) 


Since Tj = oo at the leaf nodes, we have 

<t>j - 0parent(e) = 7W(72 Tj) - 7 lfj( 7 2 0 ) 

= 7l( cr ( 00 ) - cr(0)) 

-= 21 

2 ' 

If the noise variance is known, < 7 / can be set to the appropriate value. In our case, the noise variance is unknown; 
hence, we parametrize Uy as 71 /K and set K = min(2000, 2N) to ensure that the noise variance <jy is a non-zero 
fraction of the total variance 7 i /2 + 71 /A'. 

We maximize q(Y\6,T) over /i#, 71 , and K, leading to 

n 

n 

Note that we could have instead performed gradient descent on the actual marginal likelihood produced as a 
byproduct of belief propagation. It would be interesting to investigate this. 

The likelihood q(Y\6,T) does not depend on 72 , and so we cannot choose 72 by optimizing it. We know, however, 
that r increases with N. Moreover, Lakshminarayanan et al. [16] observed that the average tree depths were 2-3 
times log 2 (iV) in practice. We therefore pre-process the training data to lie in [0, 1]-° and set 72 = 9Q 1( ^ N since 
(i) r increases with tree depth and the tree depth is 0{ log 2 N) assuming balanced trees and (ii) r is inversely 
proportional to D. In Appendix C, we describe a fast approximation which does not involve estimation of 71 , 72 - 

C Fast approximation to message passing and hyperparameter estimation 

In Section 5.3, we suggested a fast C>(logn) approximation to exact message passing which costs 0{n). Under 
this approximation, the Gaussian posterior at each node is approximated by a Gaussian distribution whose mean 
and variance are given by the empirical mean and variance of the data points at that node. This approximation 
is better suited for online applications since adding a new data point involves just updating mean and variance 
for all the nodes along the path from root to a leaf. Another advantage of this approximation is that we only 
need to set the noise variance cU and do not need to set the hyper-parameters {/r#, 71 , 72 }- 

Since our initial publication, we have learnt that this Gaussian posterior approximation is similar to a random 
forest modification independently proposed in Hutter et al. [14, §4.3.2], I 11 [14], each tree outputs a predictive 
mean and variance equal to the empirical mean and variance of the labels at the leaf node of the decision tree. 
However, there is an additional level of smoothing in MFs that is not present in [14]. Specifically, the prediction 
from a Mondrian tree, described in (3), is a weighted mixture of predictions from nodes along the path from the 
root to the leaf. Moreover, the weights account for the distance between the test point from the training data, 
thereby ensuring that the predictions shrink to the prior as we move farther away from the training data. 


Ph = 


A 1 i 
7l( 2 + K ] - 




